Indexing Shared Content in Information Retrieval Systems
نویسندگان
چکیده
Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.
منابع مشابه
Content Based Radiographic Images Indexing and Retrieval Using Pattern Orientation Histogram
Introduction: Content Based Image Retrieval (CBIR) is a method of image searching and retrieval in a database. In medical applications, CBIR is a tool used by physicians to compare the previous and current medical images associated with patients pathological conditions. As the volume of pictorial information stored in medical image databases is in progress, efficient image indexing and retri...
متن کاملیک روش مبتنی بر خوشهبندی سلسلهمراتبی تقسیمکننده جهت شاخصگذاری اطلاعات تصویری
It is conventional to use multi-dimensional indexing structures to accelerate search operations in content-based image retrieval systems. Many efforts have been done in order to develop multi-dimensional indexing structures so far. In most practical applications of image retrieval, high-dimensional feature vectors are required, but current multi-dimensional indexing structures lose their effici...
متن کاملMulti-Mode Indices for Effective Image Retrieval in Multimedia Systems
This paper presents a multi-mode indexing scheme for effective content-based image retrieval. Three types of indices are identified: visual indices for quantifiable visual information, semantic indices for non-quantifiable semantic information, keywords indices for keywords or free text. The underlying index structures are the HG-tree and the signature file. The HG-tree is one of the most promi...
متن کاملAdaptive Systems for Multimedia Information Retrieval
Multimedia information retrieval poses both technical and design challenges beyond those of established text retrieval. These issues extend both to the entry of search requests, system interation and the browsing of retrieved content, and the methodologies and techniques for content indexing. Prototype multimedia information retrieval systems are currently being developed which enable the explo...
متن کاملFrom Low Level Features to High Level Semantics
A typical content-based information retrieval (CBIR) system, e.g., an image or video retrieval system, includes three major aspects: feature extraction, high dimensional indexing and system design [1]. Among the three aspects, high dimensional indexing is important for speed performance; system design is critical for appearance performance; and feature extraction is the key to accuracy performa...
متن کامل